Goto

Collaborating Authors

 linear combination


More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations

arXiv.org Machine Learning

Feedforward network (FFN) layers account for a large fraction of parameters and nonlinear expressivity in Transformer-based large language models (LLMs). Despite the evolution from ReLU and GELU to gated variants such as SwiGLU, most FFN designs still use a single fixed activation function, applying the same nonlinear transformation to all tokens. In this work, we propose Mixture of Activations (MoA), a token-adaptive FFN design that mixes a dictionary of activation functions using lightweight input-dependent gates while sharing the same linear projections. As an input-independent counterpart, we also introduce learnable activations (LA), which form linear combinations of activation functions for both ReLU-type and SwiGLU-type FFNs. Theoretically, we establish strict finite-width expressive separations among fixed-activation FFNs, LA, and MoA: LA strictly contains fixed-activation FFNs, while MoA strictly contains LA, with the additional expressivity arising from input-dependent nonlinear hybridization. Empirically, we evaluate MoA through extensive pre-training experiments on dense and MoE language models ranging from 0.12B to 2B parameters under different token budgets, optimizers, and learning rate schedules. MoA consistently achieves lower terminal loss and exhibits more favorable scaling behavior than well-tuned baselines, with minimal parameter and computational overhead. These results suggest that token-adaptive activation mixing is a simple and effective mechanism for improving FFN expressivity in LLMs.


Symbolic Regression via Neural Networks

arXiv.org Machine Learning

Machine learning - specifically deep learning - techniques have shown their capabilities in approximating dynamics from data, but a shortcoming of traditional deep learning is that there is little insight into the underlying mapping beyond its numerical output for a given input. This limits their utility in analysis beyond simple prediction. Simultaneously, a number of strategies exist which identify models based on a fixed dictionary of basis functions, but most either require some intuition or insight about the system, or are susceptible to overfitting or a lack of parsimony. Here we present a novel approach that combines the flexibility and accuracy of deep learning approaches with the utility of symbolic solutions: a deep neural network that generates a symbolic expression for the governing equations. We first describe the architecture for our model, then show the accuracy of our algorithm across a range of classical dynamical systems. The dynamics of quantities of interest are widely modeled A number of authors have approached system identificaas differential equations, often derived from first princi-tion by fitting coefficients of a linear combination of basis 3ples. However, this is not always possible, especially whenfunctions, dating at least back to Crutchfield and McNamara . The The set of basis functions typically includes nonlinear terms, identification of models from data has seen significant ad-for example terms which would arise in a Taylor series exvances with the advent of machine learning. While deeppansion about the origin of the system3-6 or a broader class neural networks have enabled sufficient accuracy in fore-of functions7. The coefficients of the basis functions are decasting dynamic data with unprecedented versatility, thetermined through comparison of the original data points with models they represent lack closed-form expressions thatpoints from computed solutions to the fitted models. Varican be conducive to interpretation and analysis.



Towards Lower Bounds on the Depth of ReLU Neural Networks

Neural Information Processing Systems

We contribute to a better understanding of the class of functions that is represented by a neural network with ReLU activations and a given architecture. Using techniques from mixed-integer optimization, polyhedral theory, and tropical geometry, we provide a mathematical counterbalance to the universal approximation theorems which suggest that a single hidden layer is sufficient for learning tasks. In particular, we investigate whether the class of exactly representable functions strictly increases by adding more layers (with no restrictions on size). This problem has potential impact on algorithmic and statistical aspects because of the insight it provides into the class of functions represented by neural hypothesis classes. However, to the best of our knowledge, this question has not been investigated in the neural network literature. We also present upper bounds on the sizes of neural networks required to represent functions in these neural hypothesis classes.




Appendix 1 Interpretation using rank-1 Nyström approximation

Neural Information Processing Systems

The bound in Equation 5 of the main paper can be interpreted using a rank-1 Nyström approximation for f(xt,xt). By holding w fixed and maximizing for q in the right hand side of Equation 5, we get q = f(w,w) P t ytf(xt,w) where f(w,w) indicates the pseudo-inverse.1 Typically the weight vector w, often called a "landmark", used in the Nyström approximation is set either by setting it to a random input or by more sophisticated schemes like setting it with KMeans. In our case, we are directly optimizing the landmarks via Equation 6 in the main paper. To our knowledge the only other work to do this was performed in Fu [2014]. The code used in the main training loop of our algorithm is shown in Figure 1.



Algebraic Invariants of Lightning Self-Attention

arXiv.org Machine Learning

We study the polynomial coefficients of lightning self-attention as coordinates of an algebraic variety. We identify linear and nonlinear families of algebraic invariants, including Chow-type, low-rank, Veronese-type, and Sylvester resultant-based constraints.


Effective sample size approximations as entropy measures

arXiv.org Machine Learning

In this work, we analyze alternative effective sample size (ESS) metrics for importance sampling algorithms, and discuss a possible extended range of applications. We show the relationship between the ESS expressions used in the literature and two entropy families, the Rényi and Tsallis entropy. The Rényi entropy is connected to the Huggins-Roy's ESS family introduced in \cite{Huggins15}. We prove that that all the ESS functions included in the Huggins-Roy's family fulfill all the desirable theoretical conditions. We analyzed and remark the connections with several other fields, such as the Hill numbers introduced in ecology, the Gini inequality coefficient employed in economics, and the Gini impurity index used mainly in machine learning, to name a few. Finally, by numerical simulations, we study the performance of different ESS expressions contained in the previous ESS families in terms of approximation of the theoretical ESS definition, and show the application of ESS formulas in a variable selection problem.